Last updated: 2023-06-07
Checks: 6 1
Knit directory: DHT_fibroblast_snRNAseq/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown is untracked by Git. To know which version of the R
Markdown file created these results, you’ll want to first commit it to
the Git repo. If you’re still working on the analysis, you can ignore
this warning. When you’re finished, you can run
wflow_publish to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20230528) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 2a24267. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/mf43/
Ignored: data/mf61/
Ignored: data/sca.rds
Ignored: output/fibroblast_seurat.rds
Untracked files:
Untracked: .DS_Store
Untracked: analysis/Differential_Expression.Rmd
Untracked: analysis/Preprocess_fibro_from10x.Rmd
Untracked: analysis/Subtype_Prediction.Rmd
Untracked: analysis/make_gsNetwork.R
Untracked: analysis/plot_commu_temp.R
Untracked: data/.DS_Store
Untracked: data/AR_sig_down.csv
Untracked: data/AR_sig_up.csv
Untracked: data/Wu_2020/
Untracked: data/genesGR.rds
Untracked: output/Table_Chap5_Supp3.xlsx
Untracked: output/pred_hpca.rds
Untracked: output/pred_wu.rds
Untracked: output/zlm_summary_mixed.rds
Unstaged changes:
Modified: .gitignore
Modified: README.md
Modified: analysis/_site.yml
Deleted: analysis/about.Rmd
Modified: analysis/index.Rmd
Deleted: analysis/license.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with
wflow_publish() to start tracking its development.
library(tidyverse)
library(yaml)
library(scales)
library(pander)
library(glue)
library(edgeR)
library(AnnotationHub)
library(ensembldb)
library(cowplot)
library(ggfortify)
library(magrittr)
library(cqn)
library(ggrepel)
library(DT)
library(Seurat)
library(corrplot)
library(plotly)
library(patchwork)
library(colorspace)
library(ggpubr)
library(randomcoloR)
library(ggforce)
library(MAST)
library(pheatmap)
library(BiocParallel)
library(patchwork)
library(biomartr)
panderOptions("table.split.table", Inf)
panderOptions("big.mark", ",")
theme_set(
theme_bw() +
theme(
panel.background = element_blank(),
panel.grid = element_blank()
))
Color palettes used by most scRNA-seq methods could be found here.
# sub_cols <- c("#729ECE", "#FF9E4A", "#67BF5C", "#ED665D", "#AD8BC9") %>%
# set_names(unique(seurat[[]]$prediction))
patient_col <- c(
"MF61" = "#A71B4B",
"MF43" = "#584B9F"
)
# scanpy colors were found in https://github.com/scverse/scanpy/blob/034ca2823804645e0d4874c9b16ba2eb8c13ac0f/scanpy/plotting/palettes.py
treat_col <- c(
VEH = rgb(0.7, 0.7, 0.7),
DHT = rgb(0.8, 0.2, 0.2))
sample_col <- hcl.colors(n = 4) %>%
set_names(c("mf43DHT", "mf43VEH", "mf61DHT", "mf61VEH"))
ah <- AnnotationHub() %>%
subset(rdataclass == "EnsDb") %>%
subset(str_detect(description, "101")) %>%
subset(genome == "GRCh38")
stopifnot(length(ah) == 1)
ensDb <- ah[[1]]
genesGR <- read_rds(here::here("data/genesGR.rds"))
Outputs of the 10X cellranger pipelines were stored in
different directories for cells derived from each of the four samples.
Those raw reads were read in using the Read10X() function
and made into individualSeurat object with the
min.cells parameter set to 3 and min.features
parameter set to 200 so only features detected in at least 3 cells and
cells with at least 200 features within each sample were kept.
The 4 individual Seurat object were then combined to one
using the merge() function. The merge step simply
concatenates the counts and cell metadata together.Unique gens will also
be added to the merged objects.
mf43DHT_rep1 <- Read10X(data.dir = "~/pilot_fibroblast_scRNAseq/data/mf43/MF43_1/MF43_DHT-1/")
mf43VEH_rep1 <- Read10X(data.dir = "~/pilot_fibroblast_scRNAseq/data/mf43/MF43_1/MF43_VEH-1/")
mf61DHT <- Read10X(data.dir = "~/pilot_fibroblast_scRNAseq/data/mf61/MF61D1-G/")
mf61VEH <- Read10X(data.dir = "~/pilot_fibroblast_scRNAseq/data/mf61/MF61V1-G/")
mf43DHT <- CreateSeuratObject(counts = mf43DHT_rep1, project = "mf43DHT", min.cells = 3, min.features = 200)
mf43VEH <- CreateSeuratObject(counts = mf43VEH_rep1, project = "mf43VEH", min.cells = 3, min.features = 200)
mf61DHT <- CreateSeuratObject(counts = mf61DHT, project = "mf61DHT", min.cells = 3, min.features = 200)
mf61VEH <- CreateSeuratObject(counts = mf61VEH, project = "mf61VEH", min.cells = 3, min.features = 200)
fibroblast <- merge(mf43DHT, y = c(mf43VEH, mf61DHT, mf61VEH), project = "pilot_fibroblast")
The merged object contained data for 27381 features across 29128 cells.
fibroblast[["percent.mt"]] <- PercentageFeatureSet(fibroblast, pattern = "^MT-")
The QC metrics were plotted as a violin plot.
QCplot <- VlnPlot(fibroblast,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
cols = sample_col,
log = TRUE,
ncol = 3,
group.by = "orig.ident",
combine = FALSE,
pt.size = 0)
QCplot <- lapply(QCplot, function(x){x + guides(fill=guide_legend("Sample"))})
names(QCplot) <- c("nFeature_RNA", "nCount_RNA", "percent.mt")
(QCplot$nFeature_RNA & labs(y = "# of unique features/cell", x = "", title = "") |
QCplot$nCount_RNA & labs(y = "# of gene-wise counts/cell", x = "", title = "") |
QCplot$percent.mt & labs(y = "% of Mitochondrial counts/cell", x = "", title = "")) +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = 'A')
Numers of features, total numbers of counts and percentages of mitocondrial counts for cells with in each sample.
Cells that have unique feature counts over 4,000 or less than 200 and ones have >5% mitochondrial counts were filtered out.
fibroblast <- subset(fibroblast, subset = nFeature_RNA > 200 & nFeature_RNA < 4000 & percent.mt < 5)
26491 cells passed the filtration criteria.
fibroblast$Sample <- fibroblast[[]]$orig.ident
fibroblast$patient <- ifelse(str_detect(fibroblast[[]]$orig.ident, "mf43"), "MF43", "MF61")
fibroblast$Treatment <- ifelse(str_detect(fibroblast[[]]$orig.ident, "DHT"), "DHT", "VEH")
Significantly higher numbers of cells were derived from the two MF61 samples.
NumOfCells <- fibroblast[[]] %>%
group_by(Treatment, patient, Sample) %>%
summarise(n = n()) %>%
ggplot(
aes(Sample, n, fill = Treatment)
) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=n), vjust=1.6, color="white",
position = position_dodge(0.9), size=15) +
theme_minimal() +
labs(x = "",
y = "Number of Cells") +
# theme(
# axis.text.x = element_text(size = 12, angle = 90)
# ) +
scale_fill_manual(values = treat_col)
NumOfCells
Numbers of cells in each samples, colored by treatment.Considerably more cells were from MF61.
Normalisation was performed to normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.
fibroblast <- NormalizeData(fibroblast)
Feature selection was performed on the normalised object to only focus on the 2000 genes with top cell-to-cell variation in downstream analyses, and the 10 most highly variable genes were plotted on the variance-expression plot below.
fibroblast <- FindVariableFeatures(fibroblast, selection.method = "vst", nfeatures = 2000)
marker <- head(VariableFeatures(fibroblast), 10)
plot1 <- VariableFeaturePlot(fibroblast)
plot2 <- LabelPoints(plot = plot1, points = marker, repel = TRUE)
plot2
Variance-expression relationship for all genes where the top 2000 highly variable features were colored by red and the gene names of the top 10 features were labelled by text
Next, the ScaleData() function was used to:
This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
all.genes <- rownames(fibroblast)
fibroblast <- ScaleData(fibroblast, features = all.genes)
PCA was then performed on the scaled data.
fibroblast <- RunPCA(fibroblast, features = VariableFeatures(object = fibroblast))
DimPlot(fibroblast,
reduction = "pca",
# group.by = "Treatment",
pt.size = 0.1
)
PCA plots of all fibroblast cells from the pilot study
To choose how many PCA components to keep, DimHeatmap()
function was firstly used to visualise the source of heterogeneity in a
dataset, where cells and features were ordered according to their PCA
scores for each PCA component.
DimHeatmap(fibroblast, dims = 1:20, cells = 500, balanced = TRUE)

Then the elbow plot visualising a ranking of principle components based on the percentage of variance explained by each one was plotted.
ElbowPlot(fibroblast)

Based on the plots above, I chose to use the first 18 PCs.
Cells were clustered using seurat’s
FindNeighbors() and FindClusters() function
with dims set to 18 and resolution set to
0.3.
fibroblast <- FindNeighbors(fibroblast, dims = 1:18)
fibroblast <- FindClusters(fibroblast, resolution = 0.3)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 26491
Number of edges: 804730
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9064
Number of communities: 7
Elapsed time: 4 seconds
7 unique clusters were identified.
Uniform manifold approximation and projection was perform for the cells and the dimensional reduction result was visualized.
fibroblast <- RunUMAP(fibroblast, dims = 1:15)
UMAP_clu <- DimPlot(fibroblast, reduction = "umap")
UMAP_clu
UMAP plot colored by clusters formed.
The pre-processed Seurat object was saved as an
RDS file.
fibroblast$cluster <- Idents(fibroblast)
saveRDS(fibroblast, file = here::here("output/fibroblast_seurat.rds"))
The UMAP plot was also colored by patient and treatment.
umap_p <- DimPlot(fibroblast,
reduction = "umap",
group.by = "patient",
pt.size = 0.1) +
scale_color_manual(
values = patient_col
)+
ggtitle("")
umap_t <- DimPlot(fibroblast,
reduction = "umap",
group.by = "Treatment",
pt.size = 0.1
) +
scale_color_manual(
values = treat_col
) +
ggtitle("")
UMAP_byPatient <- plot_grid(
umap_p, umap_t ,
labels = LETTERS[2:3],
hjust = -1.5,
vjust = 2
)
UMAP_byPatient
UMAP plot colored by (L) patient and (R) treatment. The confounding effect of patient was still really strong.
num_cell <- fibroblast[[]] %>%
group_by(cluster, patient, Treatment) %>%
summarise(
n = n() )%>%
ungroup() %>%
group_by(cluster) %>%
mutate(total = sum(n))
pie_ls <- num_cell %>%
group_by(patient, cluster) %>%
mutate(n_pass = sum(n)) %>%
ungroup() %>%
mutate(prop_pass = n_pass/total) %>%
dplyr::select(cluster, patient, prop_pass) %>%
unique() %>%
split(f = .$cluster) %>%
lapply(function(x){
ggplot(
x, aes(x = "", y = prop_pass, fill = patient)
) +
geom_bar(width = 1, stat = "identity") +
scale_fill_manual(values = patient_col, name = "Patient") +
coord_polar("y", start=0) +
theme_void()
})
plot_grid(
plot_grid(
plotlist = pie_ls %>%
lapply(function(x){
x + theme(
legend.position = "none"
)
}),
labels = paste("Cluster ", 0:6, sep = ""),
label_size = 12
),
get_legend(pie_ls[[1]]),
rel_widths = c(9,1))
Propotion of cells derived from each patient for each of the 6 cluster. Cluster 1 was mostly MF43 cells while all the other 5 cluster were predominantly MF61 cells
pie_ls2 <- num_cell %>%
group_by(Treatment, cluster) %>%
mutate(n_treat = sum(n)) %>%
ungroup() %>%
mutate(prop_treat = n_treat/total) %>%
dplyr::select(cluster, Treatment, prop_treat) %>%
unique() %>%
split(f = .$cluster) %>%
lapply(function(x){
ggplot(
x, aes(x = "", y = prop_treat, fill = Treatment)
) +
geom_bar(width = 1, stat = "identity") +
scale_fill_manual(values = treat_col) +
coord_polar("y", start=0) +
theme_void()
})
plot_grid(
plotlist = pie_ls2,
labels = paste("Cluster ", 0:5, sep = "")
)
Propotion of cells derived from each traetment group for each of the 6 cluster. Apart from cluster 1, there are more vehicle cells than DHT-treated one in all clusters.
Positive markers for every cluster compared to all remaining cells
were found using the FindAllMarkers() function with
min.pct set to 0.3 and logfc.threshold set to
1.2. The numbers of markers found for each cluster were:
cluster_marker <- FindAllMarkers(
fibroblast, min.pct = 0.3, logfc.threshold = 1.2, only.pos = TRUE
)
cluster_marker %>%
split(f = .$cluster) %>%
vapply(nrow, integer(1)) %>%
pander()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| 2 | 10 | 11 | 8 | 2 | 1 | 13 |
Genes included in the heatmap are shown in the table below.
cluster_marker <- cluster_marker %>%
dplyr::select(cluster, gene_name = gene) %>%
left_join(
genes(ensDb) %>%
as.data.frame() %>%
dplyr::select(
gene_name, description,
gene_id
) %>%
mutate(
description = gsub("\\[.*?\\]", "", description)
)
) %>%
unique()
# marker_GO <- getGO(
# organism = "Homo Sapiens",
# genes = cluster_marker$gene_id,
# filters = "ensembl_gene_id"
# )
cluster_marker %>%
dplyr::select(-gene_id) %>%
# dplyr::filter(cluster == "1") %>%
# pull(gene_name)
unique() %>%
mutate_all(as.factor) %>%
datatable(
filter = "top",
options = list(scrollY = '450px')
)
Expressions of all marker genes selected for each cluster combined were visualised as a heatmap.
marker_hp <- DoHeatmap(
fibroblast,
features = cluster_marker$gene_name,
# label = FALSE,
group.by = "cluster",
angle = 0,
hjust = 0.5,
size = 24,
group.bar.height = 0.05)
marker_hp
Expressions of marker genes selected for each cluster were visualised as a heatmap
Expressions of the marker genes for each cluster were visualised
using the FeaturePlot() function.
marker_list <- cluster_marker %>%
dplyr::select(cluster, gene_name) %>%
split(f = .$cluster) %>%
lapply(pull, gene_name) %>%
lapply(unique)
names(marker_list) <- paste("Cluster", names(marker_list), sep = " ")
marker_pl <- sapply(names(marker_list), function(x){
FeaturePlot(
fibroblast, features = marker_list[[x]]
)
}, simplify = FALSE)
cluster_cols <- hue_pal()(length(marker_pl))
names(cluster_cols) <- names(marker_pl)
UMAP_pl <- sapply(names(cluster_cols), function(x){
cols <- ifelse(names(cluster_cols) == x, cluster_cols[[x]], "gray")
DimPlot(
fibroblast,
reduction = "umap",
cols = cols
)
}, simplify = FALSE)
cluster_ls <- sapply(names(marker_pl), function(x){
plot_grid(
marker_pl[[x]],
UMAP_pl[[x]],
nrow = 1,
rel_widths = c(2, 1))
}, simplify = FALSE)
htmltools::tagList(
lapply(names(cluster_ls), function(y){
cp <- htmltools::tags$em(glue(
"Expressions of the top 10 marker genes selected for cluster {y} on UMAP plots with a UMAP highlighting where cluster {y} cells locate. "
))
htmltools::div(
id = y,
class="section level4",
htmltools::h4(class = "tabset", y),
htmltools::div(
class = "figure", style = "text-align: center",
htmltools::plotTag(cluster_ls[[y]], alt = "test", width = 1500, height = 1000),
htmltools::p(
class = "caption", htmltools::tags$em(cp)
)
)
)
})
)
Expressions of the top 10 marker genes selected for cluster Cluster 0 on UMAP plots with a UMAP highlighting where cluster Cluster 0 cells locate.
Expressions of the top 10 marker genes selected for cluster Cluster 1 on UMAP plots with a UMAP highlighting where cluster Cluster 1 cells locate.